194 research outputs found
PAC-Bayesian High Dimensional Bipartite Ranking
This paper is devoted to the bipartite ranking problem, a classical
statistical learning task, in a high dimensional setting. We propose a scoring
and ranking strategy based on the PAC-Bayesian approach. We consider nonlinear
additive scoring functions, and we derive non-asymptotic risk bounds under a
sparsity assumption. In particular, oracle inequalities in probability holding
under a margin condition assess the performance of our procedure, and prove its
minimax optimality. An MCMC-flavored algorithm is proposed to implement our
method, along with its behavior on synthetic and real-life datasets
An Oracle Inequality for Quasi-Bayesian Non-Negative Matrix Factorization
The aim of this paper is to provide some theoretical understanding of
quasi-Bayesian aggregation methods non-negative matrix factorization. We derive
an oracle inequality for an aggregated estimator. This result holds for a very
general class of prior distributions and shows how the prior affects the rate
of convergence.Comment: This is the corrected version of the published paper P. Alquier, B.
Guedj, An Oracle Inequality for Quasi-Bayesian Non-negative Matrix
Factorization, Mathematical Methods of Statistics, 2017, vol. 26, no. 1, pp.
55-67. Since then Arnak Dalalyan (ENSAE) found a mistake in the proofs. We
fixed the mistake at the price of a slightly different logarithmic term in
the boun
Pycobra: A Python Toolbox for Ensemble Learning and Visualisation
We introduce \texttt{pycobra}, a Python library devoted to ensemble learning
(regression and classification) and visualisation. Its main assets are the
implementation of several ensemble learning algorithms, a flexible and generic
interface to compare and blend any existing machine learning algorithm
available in Python libraries (as long as a \texttt{predict} method is given),
and visualisation tools such as Voronoi tessellations. \texttt{pycobra} is
fully \texttt{scikit-learn} compatible and is released under the MIT
open-source license. \texttt{pycobra} can be downloaded from the Python Package
Index (PyPi) and Machine Learning Open Source Software (MLOSS). The current
version (along with Jupyter notebooks, extensive documentation, and continuous
integration tests) is available at
\href{https://github.com/bhargavvader/pycobra}{https://github.com/bhargavvader/pycobra}
and official documentation website is
\href{https://modal.lille.inria.fr/pycobra}{https://modal.lille.inria.fr/pycobra}
A Quasi-Bayesian Perspective to Online Clustering
When faced with high frequency streams of data, clustering raises theoretical
and algorithmic pitfalls. We introduce a new and adaptive online clustering
algorithm relying on a quasi-Bayesian approach, with a dynamic (i.e.,
time-dependent) estimation of the (unknown and changing) number of clusters. We
prove that our approach is supported by minimax regret bounds. We also provide
an RJMCMC-flavored implementation (called PACBO, see
https://cran.r-project.org/web/packages/PACBO/index.html) for which we give a
convergence guarantee. Finally, numerical experiments illustrate the potential
of our procedure
Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly
When confronted with massive data streams, summarizing data with dimension
reduction methods such as PCA raises theoretical and algorithmic pitfalls.
Principal curves act as a nonlinear generalization of PCA and the present paper
proposes a novel algorithm to automatically and sequentially learn principal
curves from data streams. We show that our procedure is supported by regret
bounds with optimal sublinear remainder terms. A greedy local search
implementation (called \texttt{slpc}, for Sequential Learning Principal Curves)
that incorporates both sleeping experts and multi-armed bandit ingredients is
presented, along with its regret computation and performance on synthetic and
real-life data
Wasserstein PAC-Bayes Learning: A Bridge Between Generalisation and Optimisation
PAC-Bayes learning is an established framework to assess the generalisation
ability of learning algorithm during the training phase. However, it remains
challenging to know whether PAC-Bayes is useful to understand, before training,
why the output of well-known algorithms generalise well. We positively answer
this question by expanding the \emph{Wasserstein PAC-Bayes} framework, briefly
introduced in \cite{amit2022ipm}. We provide new generalisation bounds
exploiting geometric assumptions on the loss function. Using our framework, we
prove, before any training, that the output of an algorithm from
\citet{lambert2022variational} has a strong asymptotic generalisation ability.
More precisely, we show that it is possible to incorporate optimisation results
within a generalisation framework, building a bridge between PAC-Bayes and
optimisation algorithms
PAC-Bayes Generalisation Bounds for Heavy-Tailed Losses through Supermartingales
While PAC-Bayes is now an established learning framework for light-tailed
losses (\emph{e.g.}, subgaussian or subexponential), its extension to the case
of heavy-tailed losses remains largely uncharted and has attracted a growing
interest in recent years. We contribute PAC-Bayes generalisation bounds for
heavy-tailed losses under the sole assumption of bounded variance of the loss
function. Under that assumption, we extend previous results from
\citet{kuzborskij2019efron}. Our key technical contribution is exploiting an
extention of Markov's inequality for supermartingales. Our proof technique
unifies and extends different PAC-Bayesian frameworks by providing bounds for
unbounded martingales as well as bounds for batch and online learning with
heavy-tailed losses.Comment: New Section 3 on Online PAC-Baye
Differentiable PAC-Bayes Objectives with Partially Aggregated Neural Networks
We make three related contributions motivated by the challenge of training
stochastic neural networks, particularly in a PAC-Bayesian setting: (1) we show
how averaging over an ensemble of stochastic neural networks enables a new
class of \emph{partially-aggregated} estimators; (2) we show that these lead to
provably lower-variance gradient estimates for non-differentiable signed-output
networks; (3) we reformulate a PAC-Bayesian bound for these networks to derive
a directly optimisable, differentiable objective and a generalisation
guarantee, without using a surrogate loss or loosening the bound. This bound is
twice as tight as that of Letarte et al. (2019) on a similar network type. We
show empirically that these innovations make training easier and lead to
competitive guarantees
Kernel-Based Ensemble Learning in Python
We propose a new supervised learning algorithm, for classification and
regression problems where two or more preliminary predictors are available. We
introduce \texttt{KernelCobra}, a non-linear learning strategy for combining an
arbitrary number of initial predictors. \texttt{KernelCobra} builds on the
COBRA algorithm introduced by \citet{biau2016cobra}, which combined estimators
based on a notion of proximity of predictions on the training data. While the
COBRA algorithm used a binary threshold to declare which training data were
close and to be used, we generalize this idea by using a kernel to better
encapsulate the proximity information. Such a smoothing kernel provides more
representative weights to each of the training points which are used to build
the aggregate and final predictor, and \texttt{KernelCobra} systematically
outperforms the COBRA algorithm. While COBRA is intended for regression,
\texttt{KernelCobra} deals with classification and regression.
\texttt{KernelCobra} is included as part of the open source Python package
\texttt{Pycobra} (0.2.4 and onward), introduced by \citet{guedj2018pycobra}.
Numerical experiments assess the performance (in terms of pure prediction and
computational complexity) of \texttt{KernelCobra} on real-life and synthetic
datasets.Comment: 11 page
From industry-wide parameters to aircraft-centric on-flight inference: improving aeronautics performance prediction with machine learning
Aircraft performance models play a key role in airline operations, especially
in planning a fuel-efficient flight. In practice, manufacturers provide
guidelines which are slightly modified throughout the aircraft life cycle via
the tuning of a single factor, enabling better fuel predictions. However this
has limitations, in particular they do not reflect the evolution of each
feature impacting the aircraft performance. Our goal here is to overcome this
limitation. The key contribution of the present article is to foster the use of
machine learning to leverage the massive amounts of data continuously recorded
during flights performed by an aircraft and provide models reflecting its
actual and individual performance. We illustrate our approach by focusing on
the estimation of the drag and lift coefficients from recorded flight data. As
these coefficients are not directly recorded, we resort to aerodynamics
approximations. As a safety check, we provide bounds to assess the accuracy of
both the aerodynamics approximation and the statistical performance of our
approach. We provide numerical results on a collection of machine learning
algorithms. We report excellent accuracy on real-life data and exhibit
empirical evidence to support our modelling, in coherence with aerodynamics
principles.Comment: Published in Data-Centric Engineerin
- …